Alternate fix for conversations that stop responding due to stale execution status by DoubleDensity · Pull Request #2470 · OpenHands/software-agent-sdk

DoubleDensity · 2026-03-16T19:39:31Z

… RUNNING reset already happens outside the loop at lines 586-592, so we shouldn't check for STUCK again inside the loop and break.

Summary

[fill in a summary of this PR]

Checklist

If the PR is changing/adding functionality, are there tests to reflect this?
If there is an example, have you run the example to make sure that it works?
If there are instructions on how to run the code, have you followed the instructions and made sure that it works?
If the feature is significant enough to require documentation, is there a PR open on the OpenHands/docs repository with the same branch name?
Is the github CI passing?

… RUNNING reset already happens outside the loop at lines 586-592, so we shouldn't check for STUCK again inside the loop and break.

DoubleDensity · 2026-03-17T13:13:31Z

also did not fix the issue after further testing -- please do not merge yet

… loop, another check immediately broke if status was STUCK, preventing recovery

DoubleDensity · 2026-03-17T15:38:52Z

this one looks promising, was able to resume a previously hung base_state.json at least. will continue testing to see if any stuck sessions occur again

DoubleDensity · 2026-03-17T15:39:49Z

@OpenHands /codereview-roasted

openhands-ai · 2026-03-17T15:40:07Z

I'm on it! DoubleDensity can track my progress at all-hands.dev

openhands-ai · 2026-03-17T15:42:38Z

Summary

I analyzed PR #2470 "Alternate fix for conversations that stop responding due to stale execution status" and provided a code review using the /codereview-roasted persona.

What I did:

Retrieved PR details and diff using GitHub CLI
Analyzed the code changes in local_conversation.py
Provided a code review with:
- Verdict: ✅ Worth merging - the core fix (changing continue to break when stuck is detected) correctly addresses the infinite loop bug
- Key finding: The original bug was that when stuck was detected, the code set status to STUCK and used continue, which would loop back. With STUCK removed from the break condition, it would keep looping infinitely. The fix correctly breaks out of the loop.
- Minor improvement suggestions around documentation clarity and potential extraction of stuck detection logic into a helper method

I did not modify any code - the task was to provide a code review comment as requested by the /codereview-roasted command on the PR.

xingyaoww · 2026-03-18T15:55:24Z

@OpenHands /codereview-roasted

openhands-ai · 2026-03-18T15:55:51Z

I'm on it! xingyaoww can track my progress at all-hands.dev

xingyaoww · 2026-03-18T15:57:54Z

🔴 Needs improvement — Patching symptoms, not the disease

Linus-Style Analysis

[CRITICAL ISSUES]

1. The author already told us it doesn't work

"also did not fix the issue after further testing -- please do not merge yet"

Let me repeat that: the person who wrote this PR says it does not fix the problem. Why are we reviewing code that the author has disowned? This should be closed or converted to a draft, not reviewed.

2. [local_conversation.py, Lines 637-656] last_event_count tracking is a band-aid on a band-aid

The original code's continue → STUCK check → break two-step dance is admittedly ugly. Replacing it with a direct break is fine — that's the one good idea here. But then you bolt on last_event_count tracking in three separate places (line 595, ~654, ~671) to prevent is_stuck() from being called "too often." This is solving an imaginary problem while the real one walks past you waving a flag.

The is_stuck() detector already has its own thresholds and only scans the last 20 events. Calling it once per loop iteration is cheap. The real question you should be asking is: why does the conversation stop responding after stuck detection? Adding event-count guards doesn't answer that.

3. [local_conversation.py, Lines 646-656] The break creates a new problem

The original flow was: detect stuck → set STUCK → continue → next iteration checks STUCK → break. Your fix: detect stuck → set STUCK → break. Same outcome, cleaner path — good.

But here's what you missed: after run() exits via break, nothing resets the status to IDLE. Look at the flow:

Stuck detected → status = STUCK → break out of loop
run() returns
User sends new message via send_message() → sees STUCK → resets to IDLE ✓
User calls run() again → sees STUCK at line 586-592 → resets to RUNNING ✓

OK, that part is actually fine. But the original code had the same path (continue → break), so your change to break is functionally identical to the original for stuck detection. The only behavioral difference is the last_event_count guard, which skips stuck detection when no new events appear. If the agent is spinning without producing events (e.g., an empty step), you'll never detect stuck. That's a regression.

4. [local_conversation.py, Line 595 / Line 654 / Line 671] Three-way state tracking is a complexity smell

last_event_count is updated at initialization, after stuck check, and after agent.step(). That's three places maintaining the same bookkeeping. When the next person touches this loop, they'll need to reason about all three to understand when stuck detection fires. This violates the "if you need more than 3 levels of indentation, you're screwed" principle — except here it's "if you need a mental flowchart to track one variable, you're screwed."

[IMPROVEMENT OPPORTUNITIES]

5. [PR description] "[fill in a summary of this PR]" — Really?

The summary is literally a placeholder. This tells me the PR was submitted in a rush. If you can't explain what your change does in one paragraph, you probably don't fully understand what it does yet.

[TESTING GAPS]

6. Zero tests for a change to the core conversation loop

This modifies the central run loop — the beating heart of conversation execution. The stuck detection path, the break/continue behavior, the interaction between run() and send_message() across threads — none of it is covered by a new test. For a bug fix that the author says "did not fix the issue," the absence of a reproducer test is damning. How do you know when you've actually fixed it if you can't reproduce it?

7. [PR description] No evidence section

No reproduction steps, no logs showing the stuck behavior, no before/after comparison. We're flying blind.

VERDICT:
❌ Needs rework — The continue → break simplification is the right instinct, but everything else is misguided complexity. The last_event_count optimization solves a non-problem while introducing a potential regression (skipping stuck detection when step produces no events). Most critically, the author confirms this doesn't fix the actual issue.

KEY INSIGHT:
The real bug is likely a concurrency issue between run() and send_message() around the state lock, not the continue vs break choice in stuck detection. Investigate what happens to the state lock when send_message() fires between stuck detection and the loop restart — that's where conversations go to die.

openhands-ai · 2026-03-18T15:58:15Z

Summary

I completed the /codereview-roasted request on PR #2470 by:

✅ Retrieved PR context — Fetched the diff, PR metadata, and identified it's from a fork by DoubleDensity.
✅ Analyzed the full code context — Read the local_conversation.py run loop, the StuckDetector implementation, send_message() interaction, and post-loop code to understand the complete picture.
✅ Posted a Linus Torvalds-style roasted code review as a PR comment, covering:
- The author's own admission that the fix doesn't work
- The last_event_count tracking adding unnecessary complexity (3 update sites) for a non-problem
- The continue → break change being functionally identical to the original flow
- A potential regression where stuck detection is skipped when no new events appear
- Missing tests, incomplete PR description, and no evidence section
- A hypothesis that the real bug is a concurrency issue around the state lock, not the continue/break choice

No code changes were made — this was a review-only task, which is correct for a /codereview-roasted request. The review was posted at: #2470 (comment)

xingyaoww · 2026-03-18T15:59:59Z

@OpenHands based on this #2470 (comment) can you do an investigation on the actual cause of this issue?

openhands-ai · 2026-03-18T16:00:28Z

I'm on it! xingyaoww can track my progress at all-hands.dev

xingyaoww · 2026-03-18T16:15:09Z

Investigation: Root Cause Analysis for "Conversations That Stop Responding"

After a thorough investigation of the codebase, I analyzed the run loop, send_message(), StuckDetector, FIFOLock, the ConversationState autosave mechanism, and the agent-server's EventService layer. Here are my findings:

TL;DR

The PR's changes to stuck detection (continue→break, event count tracking) are addressing symptoms, not root causes. The stuck detection flow (continue → reacquire lock → check STUCK → break) is functionally equivalent to a direct break. The real issues lie in several concurrency and design problems across the SDK and agent-server layers.

Finding 1: TOCTOU Race in `event_service.send_message()` (Agent-Server)

File: openhands-agent-server/openhands/agent_server/event_service.py, lines 308-331

async def send_message(self, message: Message, run: bool = False):
    # Step A: Add message (blocking executor call)
    await loop.run_in_executor(None, self._conversation.send_message, message)
    # Step B: Check status (acquires lock briefly)
    if run:
        with self._conversation.state as state:
            run = state.execution_status != ConversationExecutionStatus.RUNNING
    # Step C: Start run if needed
    if run:
        loop.create_task(_run_with_error_handling())

There is a time-of-check-to-time-of-use gap between Step A completing and Step B checking status. During this gap:

The existing run loop could have already picked up AND processed the new message
The run loop could have exited (status changed from RUNNING → FINISHED)
Step B then sees a non-RUNNING status and starts a redundant run via Step C

While this typically results in harmless no-op runs (the new run() sees FINISHED and immediately breaks), it represents fragile logic that could cause unexpected behavior under specific timing conditions.

Finding 2: Untracked Fire-and-Forget Run Tasks (Agent-Server)

File: event_service.py, lines 325-331

# send_message creates untracked tasks:
loop.create_task(_run_with_error_handling())  # NOT stored in self._run_task

# But run() checks _run_task to prevent double-running:
if self._run_task is not None and not self._run_task.done():
    raise ValueError("conversation_already_running")

send_message() creates fire-and-forget run tasks that are NOT stored in _run_task. This means event_service.run() cannot detect runs started by send_message(), potentially leading to two concurrent conversation.run() calls. The FIFO lock serializes them, so they don't truly run in parallel, but the second run may encounter unexpected state after the first completes.

Finding 3: Exception Handler Outside Lock (SDK)

File: local_conversation.py, lines 690-705

        except Exception as e:
            self._state.execution_status = ConversationExecutionStatus.ERROR  # ← No lock!
            self._on_event(ConversationErrorEvent(...))

The exception handler in run() sets execution_status = ERROR outside any with self._state: block. If another thread (e.g., send_message()) is concurrently modifying state, this creates a race condition on the status field. The autosave in __setattr__ would persist whichever write happens last, potentially losing the ERROR status.

Finding 4: Non-Atomic File Writes (SDK)

File: io/local.py, lines 62-68

def write(self, path: str, contents: str | bytes) -> None:
    with open(full_path, "w", encoding="utf-8") as f:
        f.write(contents)

_save_base_state() writes base_state.json non-atomically. If the process crashes mid-write (which is plausible during long-running conversations), the file could be partially written, leading to corrupted JSON that fails to load on restart. This would manifest as a load error rather than "stale status," but is a robustness issue.

Finding 5: Status Check Blocks the Event Loop (Agent-Server)

File: event_service.py, lines 313-315

if run:
    with self._conversation.state as state:  # ← Sync lock on async thread!
        run = state.execution_status != ConversationExecutionStatus.RUNNING

This acquires the synchronous FIFO lock directly on the asyncio event loop thread. If the run loop is holding the lock during agent.step() (which includes LLM API calls potentially taking 10-30+ seconds), the entire event loop blocks. This means:

No WebSocket messages can be processed
No HTTP responses can be sent
No other coroutines can execute

This is likely the most impactful contributor to the "stop responding" perception — the server literally cannot respond to any requests while the lock is held during agent.step().

Finding 6: Related Issues - False Positive Stuck Detection

Issues #2482 and #1349 describe scenarios where:

Extended thinking models produce reasoning-only responses → emitted as MessageEvent → 3 consecutive agent MessageEvents triggers the monologue stuck detector → status becomes STUCK
LLM text responses without tool calls set FINISHED prematurely

These could compound with the above concurrency issues to create the "stop responding" behavior.

Recommendations (Priority Order)

Move the status check in event_service.send_message() to a thread (Finding 5) — or better, make the run decision inside conversation.send_message() itself while the lock is already held, avoiding the TOCTOU gap entirely (Findings 1 & 5).
Track all run tasks (Finding 2) — send_message() should store its run task in _run_task (or a shared tracking mechanism) so run() can detect and prevent double-running.

Move the exception handler inside the lock (Finding 3):

except Exception as e:
    with self._state:
        self._state.execution_status = ConversationExecutionStatus.ERROR

Use atomic writes for base_state.json (Finding 4) — write to a temp file then rename (which is atomic on most filesystems).
Address [Bug] Remote conversation got stuck: monologue detector false positive on extended thinking models #2482 — Fix the monologue false positive, which is a contributor to unnecessary STUCK states.

Why This PR Doesn't Fix the Issue

The PR correctly identifies that continue after stuck detection creates an unnecessary round-trip through the loop. But changing continue to break is functionally equivalent (the original continue → reacquire lock → STUCK check → break path produces the same result). The last_event_count tracking adds complexity without addressing the real problems listed above.

The root cause of "conversations that stop responding" is most likely Finding 5 (synchronous lock blocking the event loop), possibly combined with Finding 2 (untracked run tasks) and Finding 6 (false positive stuck detection). These are architectural issues in the agent-server layer, not in the stuck detection logic.

openhands-ai · 2026-03-18T16:15:34Z

Summary

I was asked to investigate the actual root cause of the "conversations that stop responding due to stale execution status" issue referenced in PR #2470.

What I did:

Thoroughly analyzed the entire code path: the run loop in local_conversation.py, send_message(), StuckDetector, FIFOLock, ConversationState autosave mechanism, and the agent-server's EventService layer
Reviewed related issues ([Bug] Remote conversation got stuck: monologue detector false positive on extended thinking models #2482, [BUG] Agent produces MessageEvent without ActionEvent, causing premature execution termination #1349) and git history for context
Posted a detailed investigation comment on PR Alternate fix for conversations that stop responding due to stale execution status #2470: Alternate fix for conversations that stop responding due to stale execution status #2470 (comment)

Key findings — 6 root causes identified:

TOCTOU race in event_service.send_message() — Gap between adding message and checking whether to auto-run
Untracked fire-and-forget run tasks — send_message() creates tasks not tracked by _run_task, allowing potential double-running
Exception handler outside lock — run()'s except clause sets ERROR without the FIFO lock
Non-atomic file writes — base_state.json could be corrupted on crash
Synchronous lock blocking the event loop (most impactful) — event_service.send_message() acquires the sync FIFO lock on the asyncio thread; when agent.step() holds this lock during slow LLM calls, the entire server becomes unresponsive
False positive stuck detection — Issues [Bug] Remote conversation got stuck: monologue detector false positive on extended thinking models #2482/[BUG] Agent produces MessageEvent without ActionEvent, causing premature execution termination #1349 show the stuck detector triggers incorrectly on extended thinking models

Conclusion: The PR's stuck detection changes (continue→break, event count tracking) are functionally equivalent to the original code and don't address the real problems. The most likely root cause is Finding 5 — the server literally cannot process any requests while the FIFO lock is held during agent.step(). No code changes were made — this was an investigation-only task as requested.

all-hands-bot · 2026-03-20T12:23:56Z

[Automatic Post]: I have assigned @simonrosenberg as a reviewer based on git blame information. Thanks in advance for the help!

DoubleDensity · 2026-03-20T13:50:21Z

Just adding that this fix has held strong under continuous use for 48 hours, have not been able to reproduce the original issue anymore.

csmith49

Looks good to me, modulo the event count tracking. I can't see a situation where it actually modifies the control-flow, and worry it will act as a red herring when resolving future issues. Will approve this if it's removed or if we can figure out that it's necessary for this fix.

simonrosenberg · 2026-03-20T22:02:03Z

Just adding that this fix has held strong under continuous use for 48 hours, have not been able to reproduce the original issue anymore.

Do you have a script that runs the agent a few hundred times so that we can reproduce the problem?
If so, can you validate that the event count tracking is indeed necessary to fix the issue, that the stuck part itself is not enough? Thanks!

Edit: if it's a burden to run that many runs with a personal key, I can always take over :) let me know!

DoubleDensity · 2026-03-20T23:22:37Z

thanks for taking a look at this @simonrosenberg ! I don't have a script or environment to automated test this in, I can only give anecdotal confirmation at the moment. I have been using OpenHands at work these past few weeks, but having limited success because of this issue breaking my conversations. Since I have put this patch in I have been using it non-stop with Nemotron 3 Super 120B on a DGX Spark w/ vLLM and OpenHands has not locked up again once, that was about three days ago now.

enyst · 2026-03-21T00:12:44Z

openhands-sdk/openhands/sdk/conversation/impl/local_conversation.py


                    # Check for stuck patterns if enabled
+                    # Only check if new events have been added since last check
+                    # to avoid repeatedly triggering on the same pattern


Do you know by any chance what was the pattern? Or which of the stuck scenarios got triggered?

enyst · 2026-03-21T00:22:26Z

@DoubleDensity Thank you for looking into this! I think we fixed just today the cases where Qwen had tool calls in reasoning content, which caused it to the get stuck and fail a lot. I don't know if if the cause could be similar to what is causing this, just so you know it was described in this comment and related investigations and PR.

I wonder if you could try running on an updated main?
PR:

Fix Qwen3.5-Flash low submission rate: improve JSON arg parsing and add corrective feedback #2512

The problem there was that the SDK interpreted the messages with empty content and empty tool calls as empty messages (since content and tool calls were 'hidden' in reasoning where we don't read them), and then 3 empty messages trigger the STUCK detector.

DoubleDensity · 2026-03-24T14:39:41Z

thanks @enyst I'll test it out!

DoubleDensity · 2026-03-26T17:27:56Z

hey @enyst ! I tried running #2512 for a couple days to see it was a better fix, and it looked promising. Went ~30 hours with no issues, but now has become stuck:

~/.openhands/conversations/129e7f493c6f499c881902714f3398e5$ cat base_state.json | jq '.execution_status'
"stuck"

I will make a backup of that conversation in case there is any useful metadata you want to extract from it

DoubleDensity · 2026-03-26T17:30:54Z

it does seem to at least partially fix the problem; I was able to ask OpenHands to continue after it became stuck, and was able to resume successfully. so #2512 is definitely an improvement, but it can still stop before reaching completion

DoubleDensity · 2026-03-26T17:34:33Z

with #2512 it resumed, but then stopped again after only 4 turns. not sure how to explain it, but not ideal

enyst · 2026-03-26T22:47:22Z

What was the reason for stopping again and did the LLM actually repeat itself?

Removed STUCK from the pause check inside the while loop; The STUCK →…

55d51b4

… RUNNING reset already happens outside the loop at lines 586-592, so we shouldn't check for STUCK again inside the loop and break.

DoubleDensity added 2 commits March 17, 2026 09:30

Outside the loop, STUCK → RUNNING was correctly reset. But inside the…

ef5f3d1

… loop, another check immediately broke if status was STUCK, preventing recovery

Track event count to prevent repeated stuck detection

bfbf857

Merge branch 'main' into stuck_conversation_fix_alternate

502df78

DoubleDensity mentioned this pull request Mar 18, 2026

Fixes conversations that stop responding due to stale execution status #2466

Closed

5 tasks

Merge branch 'main' into stuck_conversation_fix_alternate

56a0d06

all-hands-bot requested a review from simonrosenberg March 20, 2026 12:23

xingyaoww requested review from csmith49 and enyst March 20, 2026 16:57

csmith49 requested changes Mar 20, 2026

View reviewed changes

enyst reviewed Mar 21, 2026

View reviewed changes

Conversation

DoubleDensity commented Mar 16, 2026

Summary

Checklist

Uh oh!

DoubleDensity commented Mar 17, 2026

Uh oh!

DoubleDensity commented Mar 17, 2026

Uh oh!

DoubleDensity commented Mar 17, 2026

Uh oh!

openhands-ai bot commented Mar 17, 2026

Uh oh!

openhands-ai bot commented Mar 17, 2026

Summary

Uh oh!

xingyaoww commented Mar 18, 2026

Uh oh!

openhands-ai bot commented Mar 18, 2026

Uh oh!

xingyaoww commented Mar 18, 2026

🔴 Needs improvement — Patching symptoms, not the disease

Linus-Style Analysis

Uh oh!

openhands-ai bot commented Mar 18, 2026

Summary

Uh oh!

xingyaoww commented Mar 18, 2026

Uh oh!

openhands-ai bot commented Mar 18, 2026

Uh oh!

xingyaoww commented Mar 18, 2026

Investigation: Root Cause Analysis for "Conversations That Stop Responding"

TL;DR

Finding 1: TOCTOU Race in event_service.send_message() (Agent-Server)

Finding 2: Untracked Fire-and-Forget Run Tasks (Agent-Server)

Finding 3: Exception Handler Outside Lock (SDK)

Finding 4: Non-Atomic File Writes (SDK)

Finding 5: Status Check Blocks the Event Loop (Agent-Server)

Finding 6: Related Issues - False Positive Stuck Detection

Recommendations (Priority Order)

Why This PR Doesn't Fix the Issue

Uh oh!

openhands-ai bot commented Mar 18, 2026

Summary

Uh oh!

all-hands-bot commented Mar 20, 2026

Uh oh!

DoubleDensity commented Mar 20, 2026

Uh oh!

csmith49 left a comment

Choose a reason for hiding this comment

Uh oh!

simonrosenberg commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DoubleDensity commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

enyst Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

enyst commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DoubleDensity commented Mar 24, 2026

Uh oh!

DoubleDensity commented Mar 26, 2026

Uh oh!

DoubleDensity commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DoubleDensity commented Mar 26, 2026

Uh oh!

enyst commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Finding 1: TOCTOU Race in `event_service.send_message()` (Agent-Server)

simonrosenberg commented Mar 20, 2026 •

edited

Loading

DoubleDensity commented Mar 20, 2026 •

edited

Loading

enyst commented Mar 21, 2026 •

edited

Loading

DoubleDensity commented Mar 26, 2026 •

edited

Loading